This study takes a data-driven approach to explore the degree to which the reactor calibration settings and composition of the feed hydrocarbon mixed influence the yield of hydrocarbon in a target range of densities.
Hydrocarbons are chemical compounds with molecules consisting of hydrogen and carbon atoms. Hydrocracking is an industrial process in which hydrocarbons with molecules made of long chains of atoms are broken down into smaller molecules. This is done by the addition of hydrogen at high pressures and temperatures in the presence of a chemical catalyst. This process is used in the petrochemical industry to increase the proportion of extracted hydrocarbons that have shorter molecules, which are typically more useful to consumers and attract a higher profit on sale.
The process of hydrocracking can be controlled by adjusting the temperature of the reactor (or equivalently its pressure), the catalyst that is used and the time for which the hydrocarbon mixture is within the reactor. The composition of the effluent hydrocarbon mixture will depend on these settings of the reactor and on the composition of the feed hydrocarbon mixture that entered the reactor. The most useful and valuable hydrocarbons are not the heaviest (most dense, with longest molecules) or the lightest (least dense, with smallest molecules), but those within a density in a specific, intermediate range. This is known as the target range.
This study uses a qualitative and quantitative approach to assess which of their control variables are most influential on the yield of hydrocarbon in the target range of densities. We will also predict the yield in this range for a given feed composition and reactor calibration. This would allow to not only maximise profits, but to minimise waste by calibrating reactor output to meet but not exceed demand.
The data contains observations for 497 days of 44 variables, which detail the reactor settings and feed composition.In particular:
## Rows: 497 Columns: 44
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (43): catalyst, temperature, through_time, feed_fraction_1, feed_fracti...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| date | catalyst | temperature | through_time | feed_fraction_1 | feed_fraction_2 | feed_fraction_3 | feed_fraction_4 | feed_fraction_5 | feed_fraction_6 | feed_fraction_7 | feed_fraction_8 | feed_fraction_9 | feed_fraction_10 | feed_fraction_11 | feed_fraction_12 | feed_fraction_13 | feed_fraction_14 | feed_fraction_15 | feed_fraction_16 | feed_fraction_17 | feed_fraction_18 | feed_fraction_19 | feed_fraction_20 | out_fraction_1 | out_fraction_2 | out_fraction_3 | out_fraction_4 | out_fraction_5 | out_fraction_6 | out_fraction_7 | out_fraction_8 | out_fraction_9 | out_fraction_10 | out_fraction_11 | out_fraction_12 | out_fraction_13 | out_fraction_14 | out_fraction_15 | out_fraction_16 | out_fraction_17 | out_fraction_18 | out_fraction_19 | out_fraction_20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-10-15 | 2 | 430.4965 | 14.4852 | 0.1665249 | 0.1067418 | 0.1445298 | 0.0996205 | 0.0618106 | 0.0740259 | 0.0806809 | 0.0379410 | 0.0398031 | 0.0426279 | 0.0176451 | 0.0113143 | 0.0398534 | 0.0180133 | 0.0131453 | 0.0093708 | 0.0148047 | 0.0143313 | 0.0037179 | 0.0034976 | 0.0469721 | 0.0699713 | 0.1127505 | 0.1045085 | 0.1006671 | 0.1086029 | 0.0802410 | 0.0799082 | 0.0443304 | 0.0498234 | 0.0316218 | 0.0387775 | 0.0104624 | 0.0173334 | 0.0249919 | 0.0241984 | 0.0133429 | 0.0117161 | 0.0106210 | 0.0187093 |
| 2020-10-16 | 0 | 561.7924 | 9.0106 | 0.1674887 | 0.1137283 | 0.1457275 | 0.0754686 | 0.0975185 | 0.0554127 | 0.0687447 | 0.0472354 | 0.0380798 | 0.0248395 | 0.0384805 | 0.0183524 | 0.0354908 | 0.0141362 | 0.0257403 | 0.0102525 | 0.0060617 | 0.0098716 | 0.0043457 | 0.0030245 | 0.1330713 | 0.1069881 | 0.1639031 | 0.0824701 | 0.0720573 | 0.0814050 | 0.0607400 | 0.0527927 | 0.0496219 | 0.0258447 | 0.0374327 | 0.0209244 | 0.0287991 | 0.0206835 | 0.0275374 | 0.0087909 | 0.0089273 | 0.0081185 | 0.0059386 | 0.0039278 |
| 2020-10-17 | 0 | 586.2617 | 14.2961 | 0.1530465 | 0.1271823 | 0.1333010 | 0.1264819 | 0.0725774 | 0.0868296 | 0.0633655 | 0.0264019 | 0.0662125 | 0.0218494 | 0.0196854 | 0.0148009 | 0.0171536 | 0.0206254 | 0.0241243 | 0.0022259 | 0.0053160 | 0.0096257 | 0.0018747 | 0.0073200 | 0.1218003 | 0.1178157 | 0.1244867 | 0.1484372 | 0.0608819 | 0.1039893 | 0.0593819 | 0.0396106 | 0.0536584 | 0.0407452 | 0.0213145 | 0.0094543 | 0.0191581 | 0.0214941 | 0.0171699 | 0.0137099 | 0.0068070 | 0.0058209 | 0.0049325 | 0.0092757 |
| 2020-10-18 | 1 | 886.6055 | 11.8744 | 0.1515063 | 0.1245180 | 0.1454614 | 0.0908515 | 0.1178410 | 0.0899024 | 0.0322081 | 0.0332607 | 0.0518593 | 0.0406130 | 0.0311355 | 0.0133577 | 0.0218595 | 0.0105603 | 0.0047110 | 0.0188433 | 0.0018196 | 0.0057411 | 0.0107530 | 0.0031976 | 0.0418474 | 0.1275574 | 0.0931776 | 0.1200447 | 0.1468649 | 0.0923605 | 0.0535594 | 0.0565904 | 0.0482816 | 0.0410244 | 0.0370487 | 0.0289327 | 0.0256415 | 0.0211631 | 0.0134359 | 0.0140301 | 0.0089632 | 0.0067843 | 0.0081899 | 0.0141709 |
| 2020-10-19 | 0 | 771.2356 | 9.9849 | 0.1700570 | 0.1423075 | 0.0706251 | 0.0781760 | 0.0922736 | 0.0868439 | 0.0562915 | 0.0563060 | 0.0253055 | 0.0273335 | 0.0356485 | 0.0431528 | 0.0278995 | 0.0144348 | 0.0183639 | 0.0106504 | 0.0187599 | 0.0117113 | 0.0129416 | 0.0009176 | 0.0786254 | 0.1184928 | 0.1241685 | 0.0952780 | 0.0802737 | 0.0807306 | 0.0740011 | 0.0530756 | 0.0530620 | 0.0299637 | 0.0347347 | 0.0325937 | 0.0404441 | 0.0203238 | 0.0158949 | 0.0167682 | 0.0142433 | 0.0162894 | 0.0122491 | 0.0085575 |
The variable date is of no use for the purposes of this
study, since it doesn’t affect the effluent mixture.The variable
catalyst is categorical in nature and not numerical. The
density intervals are ranging from 1 to 20.Interval 1 corresponds to the
longest, heaviest molecules and interval 20 to the shortest,lightest
molecules. This study is targeting the density ranges 13, 14, and 15,
which we will refer to as the response variables.
Next, we will proceed to create the final data frame by droping the column date and converting the variable catalyst to factor.
#Clean data --------------------------------------------------------------------
# Omit date column
hydro<-hydro%>% select(-date)
#Convert catalyst to factor
hydro$catalyst<-as.factor(hydro$catalyst)
First, we will have a brief look to the summary statistics of our data. For each catalyst, we note that there are almost exactly the same amount of observations.
## catalyst temperature through_time feed_fraction_1 feed_fraction_2
## 0:167 Min. :400.2 Min. : 5.334 Min. :0.1053 Min. :0.08522
## 1:162 1st Qu.:557.4 1st Qu.: 8.139 1st Qu.:0.1531 1st Qu.:0.12642
## 2:168 Median :678.6 Median : 9.857 Median :0.1701 Median :0.14023
## Mean :696.6 Mean : 9.937 Mean :0.1710 Mean :0.14193
## 3rd Qu.:849.4 3rd Qu.:11.625 3rd Qu.:0.1886 3rd Qu.:0.15566
## Max. :999.0 Max. :14.923 Max. :0.2536 Max. :0.23784
## feed_fraction_3 feed_fraction_4 feed_fraction_5 feed_fraction_6
## Min. :0.07063 Min. :0.04863 Min. :0.03958 Min. :0.02719
## 1st Qu.:0.10600 1st Qu.:0.08505 1st Qu.:0.06995 1st Qu.:0.05552
## Median :0.12032 Median :0.09775 Median :0.07985 Median :0.06794
## Mean :0.12143 Mean :0.09888 Mean :0.08184 Mean :0.06860
## 3rd Qu.:0.13615 3rd Qu.:0.11103 3rd Qu.:0.09227 3rd Qu.:0.08017
## Max. :0.17720 Max. :0.16820 Max. :0.14348 Max. :0.13430
## feed_fraction_7 feed_fraction_8 feed_fraction_9 feed_fraction_10
## Min. :0.02369 Min. :0.01086 Min. :0.01338 Min. :0.007345
## 1st Qu.:0.04705 1st Qu.:0.03754 1st Qu.:0.03177 1st Qu.:0.024590
## Median :0.05636 Median :0.04650 Median :0.03923 Median :0.032345
## Mean :0.05763 Mean :0.04744 Mean :0.04057 Mean :0.033352
## 3rd Qu.:0.06749 3rd Qu.:0.05553 3rd Qu.:0.04836 3rd Qu.:0.040458
## Max. :0.10640 Max. :0.10839 Max. :0.08508 Max. :0.076306
## feed_fraction_11 feed_fraction_12 feed_fraction_13 feed_fraction_14
## Min. :0.006816 Min. :0.004157 Min. :0.003287 Min. :0.001076
## 1st Qu.:0.019611 1st Qu.:0.015306 1st Qu.:0.012849 1st Qu.:0.010324
## Median :0.026044 Median :0.022244 Median :0.017157 Median :0.014951
## Mean :0.027470 Mean :0.022715 Mean :0.018748 Mean :0.015800
## 3rd Qu.:0.033513 3rd Qu.:0.028397 3rd Qu.:0.023957 3rd Qu.:0.020364
## Max. :0.072983 Max. :0.060246 Max. :0.046293 Max. :0.056730
## feed_fraction_15 feed_fraction_16 feed_fraction_17
## Min. :0.0006381 Min. :0.0005532 Min. :0.0003752
## 1st Qu.:0.0074735 1st Qu.:0.0061035 1st Qu.:0.0046891
## Median :0.0118727 Median :0.0094685 Median :0.0083000
## Mean :0.0129535 Mean :0.0107693 Mean :0.0093032
## 3rd Qu.:0.0166829 3rd Qu.:0.0141829 3rd Qu.:0.0124153
## Max. :0.0471171 Max. :0.0456827 Max. :0.0435436
## feed_fraction_18 feed_fraction_19 feed_fraction_20 out_fraction_1
## Min. :7.297e-05 Min. :0.000112 Min. :0.0000142 Min. :0.007195
## 1st Qu.:3.301e-03 1st Qu.:0.002492 1st Qu.:0.0016774 1st Qu.:0.049193
## Median :5.949e-03 Median :0.005302 Median :0.0038044 Median :0.073570
## Mean :7.603e-03 Mean :0.006865 Mean :0.0051271 Mean :0.078046
## 3rd Qu.:1.067e-02 3rd Qu.:0.009655 3rd Qu.:0.0074127 3rd Qu.:0.101155
## Max. :3.049e-02 Max. :0.040244 Max. :0.0249957 Max. :0.224488
## out_fraction_2 out_fraction_3 out_fraction_4 out_fraction_5
## Min. :0.04044 Min. :0.05488 Min. :0.05037 Min. :0.04609
## 1st Qu.:0.09513 1st Qu.:0.11137 1st Qu.:0.09792 1st Qu.:0.08401
## Median :0.11849 Median :0.12511 Median :0.11300 Median :0.09626
## Mean :0.11787 Mean :0.12727 Mean :0.11478 Mean :0.09768
## 3rd Qu.:0.14116 3rd Qu.:0.14332 3rd Qu.:0.12708 3rd Qu.:0.11139
## Max. :0.23553 Max. :0.20614 Max. :0.22606 Max. :0.17503
## out_fraction_6 out_fraction_7 out_fraction_8 out_fraction_9
## Min. :0.03587 Min. :0.02537 Min. :0.02073 Min. :0.01798
## 1st Qu.:0.06831 1st Qu.:0.05769 1st Qu.:0.04716 1st Qu.:0.03917
## Median :0.07906 Median :0.06756 Median :0.05536 Median :0.04747
## Mean :0.08104 Mean :0.06865 Mean :0.05725 Mean :0.04792
## 3rd Qu.:0.09268 3rd Qu.:0.07847 3rd Qu.:0.06581 3rd Qu.:0.05563
## Max. :0.15221 Max. :0.13340 Max. :0.10460 Max. :0.08502
## out_fraction_10 out_fraction_11 out_fraction_12 out_fraction_13
## Min. :0.01240 Min. :0.007842 Min. :0.007249 Min. :0.005467
## 1st Qu.:0.03226 1st Qu.:0.027018 1st Qu.:0.021213 1st Qu.:0.017043
## Median :0.03932 Median :0.032290 Median :0.026944 Median :0.022292
## Mean :0.03964 Mean :0.033326 Mean :0.027245 Mean :0.022625
## 3rd Qu.:0.04593 3rd Qu.:0.039071 3rd Qu.:0.032445 3rd Qu.:0.026705
## Max. :0.09348 Max. :0.059683 Max. :0.057851 Max. :0.046839
## out_fraction_14 out_fraction_15 out_fraction_16 out_fraction_17
## Min. :0.001664 Min. :0.003675 Min. :0.002147 Min. :0.001796
## 1st Qu.:0.014493 1st Qu.:0.012234 1st Qu.:0.009327 1st Qu.:0.007672
## Median :0.017935 Median :0.015367 Median :0.012239 Median :0.010003
## Mean :0.018864 Mean :0.015754 Mean :0.012815 Mean :0.010792
## 3rd Qu.:0.022658 3rd Qu.:0.018938 3rd Qu.:0.015600 3rd Qu.:0.013421
## Max. :0.059309 Max. :0.036352 Max. :0.034909 Max. :0.032140
## out_fraction_18 out_fraction_19 out_fraction_20
## Min. :0.0006285 Min. :0.0006014 Min. :0.000653
## 1st Qu.:0.0062031 1st Qu.:0.0049325 1st Qu.:0.006485
## Median :0.0082959 Median :0.0072395 Median :0.010341
## Mean :0.0090530 Mean :0.0078201 Mean :0.011373
## 3rd Qu.:0.0116719 3rd Qu.:0.0100440 3rd Qu.:0.015239
## Max. :0.0252392 Max. :0.0257602 Max. :0.045080
Next we compute the summarry statistics for the temprature and reactor residence time of the mixture in hours for each catalyst.
| catalyst | mean | median | var | max | min |
|---|---|---|---|---|---|
| 0 | 683.1146 | 665.2807 | 28500.36 | 996.9258 | 400.1571 |
| 1 | 700.5558 | 679.5236 | 27857.04 | 998.9981 | 400.8394 |
| 2 | 706.1049 | 706.7726 | 30494.66 | 997.1580 | 404.2195 |
| catalyst | mean | median | var | max | min |
|---|---|---|---|---|---|
| 0 | 9.688389 | 9.55910 | 5.368002 | 14.7267 | 5.4070 |
| 1 | 10.086341 | 10.07815 | 5.096124 | 14.9226 | 5.5036 |
| 2 | 10.039718 | 9.97650 | 5.040449 | 14.8399 | 5.3338 |
We visualise the above results in the figure below.The box plots show the temperature and residence time per catalyst.On the first figure, we observe that the median,maximum and minimum values differ slightly for each catalyst. The distribution seems symmetric and uniform and there aren’t any outliers.
The distribution graph,below, shows the effluent mixtures for
selected density intervals.It can be clearly seen that the variables are
normally distributed and as the density interval increase the mean
decreases. What is striking is that the variation decreases as well and
the variables tend to cluster more around the mean. One reason that can
explain this is that as the size and weight of the molecules decrease it
is harder to control their proportions. There are also some outliers as
the density interval increases.
The next Figures, show the distribution plot of the proportion of the
effluent mixture for the same variables and the response variables.
Again, we can observe the same trend as above. We see that the mean
value for interval 5 have increased slightly, while for interval 1 there
is a significant decrease. In particular, for interval 15, there is a
sharp increase in the amount of data clustered around the mean, in
contrast with interval 20 where the is a sharp decrease.
The next plots show the summary statistics for the effluent mixture in
intervals 13,14 and 15 for each catalyst(0,1,2). Closer inspection of
the boxplots shows that there is an increase in proportion when using
catalyst 1 and 2 opposed to catalyst 0.There isn’t a significant
difference between catalysts 1 and 2 as the different intervals seem to
have the same median and range in the two cases.
The scatter plots below show the relationship between temperature
(left), residence time(right) and mixture proportions for interval 13,
14, 15 as well as their sum. The lines represent the regression line
which was calculates using the LOESS method.In short, this method
calculates the regression line taking into account only the nearby
points, in contrast with a standard linear regression model which uses
all points.As shown in the last graph, the sum of proportions remains
relatively steady while the temperature increases.There is a slight
decrease in proportion between 400â—¦F and 680â—¦F , but it seems that
temperature doesn’t dramatically affect the response variables. The
right graph reveals that there has been a gradual increase in the
proportion as the hours showing the residence time of the mixture in the
reactor increase.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# Prediction Having explored our variables, the following section will
discuss the methods used to determine which of the control variables are
most influential on the yield of hydrocarbon in the target range of
densities. First,we will determine which variables to use as predictors
and how to model the response variable. Moving on, we will first compute
the correlation matrix to get an initial idea of the relationship
between the variables. Machine learning models are often utilised for
feature ranking, as some models inherently rank features internally or
it’s easy to generate and access the ranking from the structure of the
model. In summary we’re going to explore Regularisation Methods, Random
Forests, Boosting Machines and Partial Least Squares
regression(PLS).
The correlation matrix is used to determine if a relationship exists between the variables. It consists of the Pearson correlation coefficients for a set of variables and shows the degree of the relationship as well as the direction. We will also calculate the significance levels (p-values) to identify whether the correlation between variables is significant.
| row | column | cor | p | |
|---|---|---|---|---|
| 232 | temp | sum | 0.08 | 0.08 |
| 233 | time | sum | 0.18 | 0.00 |
| 243 | 10 | sum | 0.09 | 0.05 |
| 244 | 11 | sum | 0.06 | 0.18 |
| 245 | 12 | sum | 0.41 | 0.00 |
| 246 | 13 | sum | 0.52 | 0.00 |
| 247 | 14 | sum | 0.38 | 0.00 |
| 248 | 15 | sum | 0.22 | 0.00 |
| 249 | 16 | sum | 0.06 | 0.20 |
The plot shows the correlation between the features and the response variable (sum). The table summarises the most important coefficients as well as their significance levels. We can see that there is a strong correlation between the sum,the time and the feed mixtures in interval 12 to 15. Their p-values are less than the significance level of 0.05, which indicates that the correlation coefficients are significant. There is a low degree of correlation between the re- sponse variable and temperature , and the p-value indicates that the coefficient is not very significant.
Before moving to the methods use, we need to determine the response variable. As mentioned above, the hydrocarbons of interest are within the 13,14 and 15 density range.This study assumes that the effluent mixtures don’t interact with each other and consequently they are omitted. We want to find the combination of variables that result in the maximum yield in intervals 13,14 and 15 and as a result, our response variable will be the sum of the mass proportions. The features or predictors used will be the variables catalyst, temperature, residence time and proportions of feeding mixtures in different intervals. Moreover, we will split the data and use 80% of the data for training and 20% for testing. In order to reduce the variance further we will use a 10-fold validation.
#Prepare Training And Test Data----------------------------------------------------------------------------
#Create response variable
hydro<-hydro %>%
mutate(sum_int = out_fraction_13+out_fraction_14+out_fraction_15)
#Define data and remove features we don't need
data<-hydro %>% select(-starts_with('out'))
#Define training indexes
trainingIndex <- sample(1:nrow(data), 0.8*nrow(data)) # indices for 80% training data
#Define training data
data.train<-data[trainingIndex, ]
#Define testing data
data.test <- data[-trainingIndex, ]
Regularization is a method that reduces overfitting of a model by adding penalty to a model. The most popular regularization methods are lasso and ridge regression, which use a penalty term λ to regularize the coefficients such that the optimization function is penalized if the coefficients are too large. The most important features will have the the highest coefficients in the model, while features that don’t affect the result as much should have coefficient values close to zero.
In this approach, we are going to implement Elastic net regression model which is a combination of the two above methods and incorporates penalties from both. This is achieved by setting the alpha parameter \(\alpha\) , which determines the degree of mixing between ridge regression and lasso: when \(\alpha\) is set to zero, the model corresponds to ridge regression , and when \(\alpha=1\) , it corresponds to lasso. Values of \(\alpha\) in between will result to a blend of the two.
In order to determine the optimal value for \(\alpha\) , we’re going to test different values and choose the one that results in the minimum Mean Squared Error(MSE). In addition, we will use 10-fold Cross Validation to determine the optimal value for \(\lambda\) for each value of \(\alpha\).The parameter \(\lambda\) is called the shrinkage parameter;when \(\lambda=0\), there is no shrinkage performed and as it increases the coefficients are shrunk more strongly. This happens regardless of the value of \(\alpha\). We’re going to use the lambda that results in the model with the minimum cross validation error and test values of \(\alpha\) that range from 0 to 1 with a step equal to 100.
##Regularization----------------------------------------------------------------------------
#Create observations and target variable for training data
x.train<- data.train %>% select(-sum_int) %>% data.matrix
y.train<- data.train$sum_int# training data
#Create observations and target variable for testing data
x.test<-data.test %>% select(-sum_int) %>% data.matrix
y.test<- data.test$sum_int
#Create empty list used to save fits
list.of.fits <- list()
#number of iterations
it<-100
## We are testing alpha = i/100.
for (i in 0:it) {
## Create variable name
fit.name <- paste0("alpha", i/it)
## Fit a model and store it in a list
list.of.fits[[fit.name]] <-
cv.glmnet(x.train, y.train, type.measure="mse", alpha=i/it,
family="gaussian")
}
## Find which alpha results in the min MSE
reg.results <- data.frame()
for (i in 0:it) {
fit.name <- paste0("alpha", i/it)
## Use each model to calculate predictions given the Testing dataset
predicted <-
predict(list.of.fits[[fit.name]],
s=list.of.fits[[fit.name]]$lambda.min, newx=x.test)
## Calculate the Mean Squared Error
mse <- mean((y.test - predicted)^2)
## Store the results
temp <- data.frame(alpha=i/it, mse=mse, fit.name=fit.name)
reg.results <- rbind(reg.results, temp)
}
#See results------Uncomment to see table
#reg.results%>% knitr::kable("html") %>%
# kableExtra::kable_styling(latex_options = "hold_position")
#Plot mse and alpha value
ggplot(reg.results,aes(x=alpha,y=mse))+
theme_bw()+
geom_line(color="#1F3552")+
ggtitle("MSE for different alpha values")+
ylab("MSE")
#Lasso model----------------------------------------------------------------------------
#Create vector with lambdas
lambdas <- 10^seq(2, -3, by = -.1)
# Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(x.train, y.train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 10)
# Best model
lambda_best <- lasso_reg$lambda.min
lasso_model <- glmnet(x.train, y.train, alpha = 1, lambda = lambda_best, standardize = TRUE)
#Save coefficients
coef.lasso<- coef(lasso_model)
#Compare coefficients with Lasso model
cbind(coef.reg, coef.lasso)
## 24 x 2 sparse Matrix of class "dgCMatrix"
## s1 s0
## (Intercept) -9.732765e-03 -2.046958e-03
## catalyst 3.781700e-03 3.335440e-03
## temperature 8.735192e-06 6.450328e-06
## through_time 7.195076e-04 5.454243e-04
## feed_fraction_1 -1.347089e-02 -2.476124e-03
## feed_fraction_2 . .
## feed_fraction_3 . .
## feed_fraction_4 . .
## feed_fraction_5 . .
## feed_fraction_6 . .
## feed_fraction_7 . .
## feed_fraction_8 . .
## feed_fraction_9 . .
## feed_fraction_10 6.547093e-02 3.528340e-02
## feed_fraction_11 1.111013e-01 6.440609e-02
## feed_fraction_12 5.063042e-01 4.707758e-01
## feed_fraction_13 8.412093e-01 7.983269e-01
## feed_fraction_14 6.890264e-01 6.338400e-01
## feed_fraction_15 3.930661e-01 3.459051e-01
## feed_fraction_16 . .
## feed_fraction_17 . .
## feed_fraction_18 . .
## feed_fraction_19 . .
## feed_fraction_20 . .
#Create data frame
imp.coef_las<-data.frame(coef=coef.lasso[coef.lasso[,1]!=0,])
#Convert row names to column
imp.coef_las<-rownames_to_column(imp.coef_las, "variables")
imp.coef_las$variables<-str_replace(imp.coef_las$variables,"feed_fraction_", "")
#Plot mse and alpha value
ggplot(reg.results,aes(x=alpha,y=mse))+
theme_bw()+
geom_line(color="#1F3552")+
ggtitle("MSE for different alpha values")+
ylab("MSE")
Moving on now to consider random forest method, which is another method used for both prediction and feature importance estimation.The method uses multiple decision trees and combines the resulted predictions to create a more accurate model. Random Forests split the data into training and test sets by default since each tree is trained using a different dataset randomly sampled from the original data.Moreover, they are often used to find the variables with the most predictive power by measuring the Node purity and how much the accuracy decreases when the variable is excluded. Random Forests are prone to overfitting and as a result we will need to find the best hyperparameters for our model and iterate through different options. Once the process is done, we train the best model and find the most important features based on mean decrease impurity.Then, we remove features from the bottom of the list and evaluate the impact on performance.
##Random forest -----------------------------------------------------------------------------------------------------------------------
#Evaluate different values for ntree.
#Create empty list for saving performance and number of trees
modellist <- data.frame()
#Loop through different number of trees
for (ntree in c(500,1000, 1500, 2000, 2500)) {
set.seed(1)
key <- toString(ntree)
fit <- randomForest(sum_int ~ .,
data=data,
ntree=ntree,
importance=TRUE,
xtest = data.frame(x.test),
ytest = y.test)
modellist<- rbind(modellist,c(ntree,sqrt(fit$mse[which.min(fit$mse)])))
}
#Format column names
colnames(modellist)<-c('ntrees','MSE')
#Find best number of trees
best.ntree<-modellist[which.min(modellist$MSE),1]
#Train best model
rf.model0 <- randomForest(sum_int ~ .,
data=data,
ntree=best.ntree,
importance=TRUE,
xtest = data.frame(x.test),
ytest = y.test)
Next we chose the best feautures according to the model above and train
a new model.
| MSE | Rsquare | RMSE |
|---|---|---|
| 1.26e-05 | 99.99935 | 0.008171 |
Random forests as mentioned above build an ensemble of deep independent trees while Gradient Boost Machines (GBMs) train trees sequentially.This means that each tree learns and improves on the previ- ous.An important feature of GBMs is the Variable Importance, which ranks the feature variables based on the importance of each in training the model. GBMs are also prone to overfitting and thus we will need to find the best hyperparameters and train the model based on the best set.
#Gradient Boosting Machine-----------------------------------------------------------------------------------------------------------------------
#Fit model
boost_feed <- gbm(sum_int ~ ., data=data.train,
distribution = "gaussian",
n.trees = 2000,
shrinkage = 0.01,
interaction.depth = 4,
cv.folds = 10)
#Summary
summary(boost_feed)
#Find the number of trees that results to the minimum square error
gbm.perf(boost_feed,method = 'cv')
## [1] 1901
which.min(boost_feed$cv.error)
## [1] 1901
The default settings in gbm includes a learning rate, which is used to reduce the impact of each addi- tional tree.In this case, we use a learning rate of 0.01 to avoid overfitting. Next, we use a number of trees of 2000 with depth of each tree equal to 4. Lastly, we specify a gaussian distribution, scale our data and perform a 10 fold cross validation. The results show that our MSE loss function is minimized with 1901 trees, which is used to compute our predictions.
#Compute predictions using the optimum number of trees
boost_pred <- predict(
boost_feed,
newdata = data.test,ntrees=which.min(boost_feed$cv.error))
## Using 1901 trees...
#Performance metrics
bg.mse<-sum((data.test$sum_int-boost_pred)^2)/length(data.test$sum_int)
bg.r2<-(1-(rf.mse0/sum((data.test$sum_int-mean)^2)/length(data.test$sum_int)))*100
bg.perf<-data.frame(MSE=bg.mse,
Rsquare=bg.r2,
RMSE= sqrt(mean((boost_pred - data.test$sum_int)^2)))
#Find influential variables
FeedEffects <- as_tibble(summary.gbm(boost_feed,plotit = FALSE))
#Format variables names
FeedEffects$var<-str_replace(FeedEffects$var,"feed_fraction_", "")
#Plot important variables
boost_coef<-FeedEffects %>%
# plot these data using columns
ggplot(aes(x = forcats::fct_reorder(.f = var,
.x = rel.inf),
y = rel.inf,
fill = rel.inf)) +
geom_col() +
# flip
coord_flip() +
# format
theme_bw() +
scale_color_brewer("Set1")+
theme(legend.position = "none")+
xlab('') +
ylab('Relative Influence') +
ggtitle("Gradient Boosting Machine")
# Create Data frame with predicted and actual values
boost.results<-data.frame(actual=data.test$sum_int,predicted=boost_pred)
# plot predicted v actual
ggplot(boost.results) +
geom_point(aes(y = predicted,
x = actual,
color = predicted - actual), alpha = 0.7) +
# add theme
theme_bw() +
# strip text
theme(axis.title = element_text()) +
# add axes/legend titles
scale_color_continuous(name = "Predicted - Actual") +
ylab('Predicted Effluent Mixture') +
xlab('Actual Effluent Mixture') +
ggtitle('Predicted vs Actual')
Partial least squares regression is a method that was developed to solve the problem of possibly cor- related feature variables, and relatively few samples.PLS uses a dimension reduction strategy that is supervised by the outcome.This means that it identifies new principal components that describe as much as possible the co-variance between the feature and response variables. These components are then used to fit the regression model. Principal Component Analysis (PCA) finds a way to reduce the dimensions of the data and the principal components are new variables that are constructed as linear combinations or mixtures of the initial variables.The number of principal components is chosen by cross-validation.
##Partial Least Squares Regression --------------------------------------------------------------------------------------------
# Build the model on training set
pls.model <- train(
sum_int~., data = data.train, method = "pls",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
## Print the best tuning parameter ncomp that
# minimize the cross-validation error, RMSE
summary(pls.model)
## Data: X dimension: 397 24
## Y dimension: 397 1
## Fit method: oscorespls
## Number of components considered: 8
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 4.452 8.641 13.31 17.43 22.02 26.37 31.10
## .outcome 78.100 81.853 82.35 82.45 82.46 82.46 82.46
## 8 comps
## X 35.38
## .outcome 82.46
# Plot model RMSE vs different values of components
pc_perf<-data.frame(ncomp=pls.model$results$ncomp, rmse=pls.model$results$RMSE, rsquared=pls.model$results$Rsquared)
rmse.pls<-ggplot(pc_perf,aes(x = ncomp,y = rmse)) +
geom_line()+
# format
scale_color_brewer(palette = "Dark2") +
theme_bw() +
scale_x_continuous(n.breaks = 10)+
xlab('Number of Compoments') +
ylab('RMSE(cross-validation') +
ggtitle("RMSE for each component")
rsquared.pls<-ggplot(pc_perf,aes(x = ncomp,y = rsquared)) +
geom_line()+
# format
scale_color_brewer(palette = "Dark2") +
theme_bw() +
scale_x_continuous(n.breaks = 10)+
xlab('Number of Compoments') +
ylab('Rsquared') +
ggtitle("Rsquared for each component")
grid.arrange(rmse.pls,rsquared.pls,ncol=2)
# Compute predictions
pls.pred <- predict(pls.model,data.test)
# Model performance metrics
pls.mse<-sum((data.test$sum_int-pls.pred)^2)/length(data.test$sum_int)
pls.perf<-data.frame(
MSE=pls.mse,
Rsquare = caret::R2(pls.pred, data.test$sum_int)*100,
RMSE = caret::RMSE(pls.pred, data.test$sum_int))
For the last method, we choose to scale our data again and use a 10-fold cross-validation to evaluate the performance of the model.Once we’ve fit the model, we determine the number of PLS components we want to keep.Looking at figure 7, we see that after 2 components there isn’t a significant change in the RMSE and R2, and hence we could use only the first two components for the optimal model.
The plots below presents the feature importance as computed by all methods. It is interesting to note that in all methods used in this study, the most influential variable is the feed hydrocarbon mixture in density interval 13. Another important finding was is that the feed hydrocarbon mixture in density intervals 12 and 14 are the second and third most influential followed very closely by molecules in interval 15. It is somewhat surprising that the Random Forests found that catalyst is more influential than feed mixture in interval 15 as it’s contradicting the results of the other methods. In addition, Elastic net and Lasso regression identify feed mixtures in intervals 10 and 11 as more important than catalysts, residence time and temperature. However, the Lasso model still considers those variables important and didn’t shrink the coefficients to zero. In comparison, the methods based on decision trees i.e Random Forest and GBMs, consider the reactor’s setting as significant. Moreover, we can see that temperature doesn’t affect the results significantly which confirms our observations in the explanatory analysis. All methods show that the choice of catalyst is more important than the residence time and temperature. Finally, we can say with some certainty that the feed mixtures that are not close to the interval ranges of interest are not influential.
### Final metrics --------------------------------------------------------------------------------------------
#Important Features
grid.arrange(cor.imp,reg_coef,las_coef,rf_coef0,rf_coef1,boost_coef,nrow=3,ncol=2)
Turning now to the prediction performance of each model.The graphs below
shows the mean square error and R2 plots for every method.The table in
the right shows the exact metrics. We observe the random forest model
trained all the variables,achieves the minimum mean square error equal
to MSE = 1.21 · 10−5 and best R2 equal to 99.9%. The following best
model is the random forest trained on the important variables with MSE =
1.25 · 10−5 and R2 = 99.9%. Next the Partial Least Square Regression,
has a small MSE of 3.81 · 10−5 and lower R2 equal to 82%. The Elastic
Net regression has the worse R2 equal to 80.18% and a quite high MSE,
equal to 3.83 · 10−5. Following the Lasso model with MSE equal to 4.23 ·
10−5 and R2 equal to 78.10%. Finally, The GBM model performs the worse
with a R2 equal to 99% but a quite high MSE equal to 4.736 · 10−5.
| Methods | MSE | Rsquare | RMSE |
|---|---|---|---|
| ElasticNet | 3.83e-05 | 80.18120 | 0.0061899 |
| Lasso | 4.23e-05 | 78.10281 | 0.0065064 |
| RF0 | 1.18e-05 | 99.99939 | 0.0086534 |
| RF1 | 1.26e-05 | 99.99935 | 0.0081710 |
| GBM | 4.59e-05 | 99.99939 | 0.0067778 |
| PLS | 3.81e-05 | 80.65886 | 0.0061739 |